library(tidyverse)
library(readxl)
path <- "Excel/800-899/865/865 Complex Regex.xlsx"
input <- read_excel(path, range = "A2:A95")
test <- read_excel(path, range = "C2:E95")
result <- input %>%
mutate(
Date = ymd(str_extract(Data, "[0-9]{2,4}[/|-][0-9]{1,2}[/|-][0-9]{1,2}")),
`Product ID` = str_extract(Data, "[A-Z]{2}[0-9]{3}"),
`Weight in Kg` = {
w <- str_extract(
Data,
"(kg|kgs|gm|gms)\\s*[0-9]+\\.?[0-9]*|[0-9]+\\.?[0-9]*\\s*(kg|kgs|gm|gms)"
)
val <- as.numeric(str_extract(w, "[0-9]+\\.?[0-9]*"))
unit <- str_extract(w, "kg|kgs|gm|gms")
case_when(
unit %in% c("gm", "gms") ~ val / 1000,
unit %in% c("kg", "kgs") ~ val,
TRUE ~ NA_real_
)
}
) %>%
select(-Data)
all.equal(result, test, check.attributes = FALSE)
# [1] TRUEExcel BI - Excel Challenge 865
excel-challenges
excel-formulas
🔰 Many legacy machines are dumping logs where the data position varies wildly.

Challenge Description
🔰 Many legacy machines are dumping logs where the data position varies wildly. The combined log is given. Extract following -
Solutions
- Logic: Read the workbook ranges needed for the challenge; Derive the required intermediate columns; Parse the packed text or string structure; Apply the business rule conditions explicitly.
- Strengths: The solution stays close to the text pattern itself, which makes the extraction logic easy to audit.
- Areas for Improvement: The solution assumes the workbook layout and selected ranges remain stable, so any structural change in the sheet would require small adjustments.
- Gem: A small number of well-targeted text patterns does most of the heavy lifting.
import pandas as pd
import re
import numpy as np
path = "Excel/800-899/865/865 Complex Regex.xlsx"
input = pd.read_excel(path, usecols="A", skiprows=1, nrows=93)
test = pd.read_excel(path, usecols="C:E", skiprows=1, nrows=93)
def extract_date(s):
m = re.search(
r"(24|2024)[-/](\d{1,2})[-/](\d{1,2})",
s or ""
)
return pd.to_datetime(m.group(0), errors="coerce", dayfirst=False, yearfirst=True) if m else pd.NaT
def extract_product_id(s):
m = re.search(r"[A-Z]{2}[0-9]{3}", s or "")
return m.group(0) if m else None
def extract_weight_kg(s):
m = re.search(r"(kg|kgs|gm|gms)\s*[0-9]+\.?[0-9]*|[0-9]+\.?[0-9]*\s*(kg|kgs|gm|gms)", s or "")
if not m: return np.nan
v = re.search(r"[0-9]+\.?[0-9]*", m.group(0))
u = re.search(r"kg|kgs|gm|gms", m.group(0))
if not v or not u: return np.nan
val, unit = float(v.group(0)), u.group(0)
return val/1000 if unit in ("gm","gms") else val if unit in ("kg","kgs") else np.nan
result = pd.DataFrame({
"Date": input.iloc[:,0].apply(extract_date),
"Product ID": input.iloc[:,0].apply(extract_product_id),
"Weight in Kg": input.iloc[:,0].apply(extract_weight_kg)
})
result["Weight in Kg"] = result["Weight in Kg"].astype("int64")
print(result.equals(test)) #The Python version expresses the core extraction rule directly and keeps the pattern matching easy to review.
Difficulty Level
Medium
The individual steps are manageable, but the correct transformation pattern is not obvious from the raw data.